For the ONT samples, (Unmodified) raw sequences were extracted from fast5 files, trimmed using chopper to remove low quality bases from the reads. Alignments were carried out using minimap2 against the mitochondrial chromosome (rCRS) and the chr22 autosome. Due to the circular nature of the mitochondrial chromosome, reads were aligned in parallel to an alternative version of the rCRS whose start-end positions were shifted by 8,000 bp. This allowed long reads spanning over the arbitrary start-end regions in the rCRS to be properly aligned. Only primary alignments were considered (SAM flags 0 and 16).
SNV calling was carried out using GATK Mutect2 in mitochondria mode for each of the three alignments (rCRS, shifted rCRS and chr22). The variants in the VCF resulting from the alignments of long reads against the 8,000 bp-shifted rCRS were lifted over to the original rCRS coordinates. A final set of mtDNA variants was obtained by combining variants within positions 4,000-11,999 from the standard rCRS reference, and within 1-3,999 and 12,000-16,569 from the alignment against the 8,000 bp-shifted reference. Haplocheck was used to test for haplotype contamination in the resulting filtered mtDNA variants.
To use as a nuclear control, we used the same methodology to call SNVs in a short region from the chr22 (chr22:20,000,000-20,100,000).
To obtain deletions at the single-read level from the ONT data, we parsed the sam CIGAR strings in the mitomap2 alignments using a custom script. Similarly, we parsed CIGAR strings to obtain the number of matching (=) and mismatching (X) positions in each single-read alignment.
See PIPELINE_short.sh for the commands used. Scripts and executables are located in the bin/ folder.
The two EDTA blood samples sequenced using Illumina NGS technology were aligned against the whole genome using the GATK recommended guidelines, following the mitochondrial-specific pipeline for variant calling.
Metadata with sample information:
Metadata <- read_csv("./Tables/Metadata.txt", col_types = "cccc") %>%
mutate(tissue = ifelse(str_starts(tissue, "edta_blood"), "edta_blood", tissue)) %>%
mutate(platform = seq) %>%
select(-seq)
Metadata %>%
arrange(platform, tissue) %>%
kable(caption = "Sample overview") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| sample_id | tissue | individual_id | platform |
|---|---|---|---|
| SL443862 | edta_blood | c | illumina |
| SL443863 | edta_blood | d | illumina |
| 6226-CT-0010 | PFC | a | ont |
| 6226-CT-0018 | cerebellum | b | ont |
| 6226-CT-0015 | colon | b | ont |
| 6226-CT-0016 | edta_blood | c | ont |
| 6226-CT-0017 | edta_blood | d | ont |
| 6226-CT-0011 | heart | b | ont |
| 6226-CT-0014 | kidney | b | ont |
| 6226-CT-0013 | liver | b | ont |
| 6226-CT-0012 | muscle | b | ont |
NOTE: sample from PFC is from a different individual (a), and EDTA blood samples (both Illumina and ONT sequencing samples) are yet from different individuals (c and d).
The following plot shows the rCRS coverage for all ONT samples (including those belonging to different individuals and tissues). Different colours correspond to different individuals.
Depth of coverage is calculated using samtools depth on both mtDNA alignments (rCRS and shifted rCRS), but positions are subsetted to 4,000-12,000 for the rCRS and 4,000-12,569 for the shifted rCRS, which correspond to chrM:1-4000 and chrM:12,000-16,569 when shifted back).
Depth of coverage for chr22 is restricted to chr22:20,000,000-20,100,000 for simplicity.
depthFiles1 <- paste0("../results/depth/",
filter(Metadata, platform == "ont")$sample_id,
"_chrM.txt")
names(depthFiles1) <- filter(Metadata, platform == "ont")$sample_id
#all(file.exists(depthFiles1))
depthFiles2 <- paste0("../results/depth/",
filter(Metadata, platform == "ont")$sample_id,
"_chrM_shifted.txt")
names(depthFiles2) <- filter(Metadata, platform == "ont")$sample_id
#all(file.exists(depthFiles2))
depthFiles22 <- paste0("../results/depth/",
filter(Metadata, platform == "ont")$sample_id,
"_chr22.txt")
names(depthFiles22) <- filter(Metadata, platform == "ont")$sample_id
#all(file.exists(depthFiles22))
Cov <- bind_rows(
lapply(names(depthFiles1), function(sid) {
read_tsv(depthFiles1[sid], col_names = FALSE, col_types = "cii") %>%
mutate(sample_id = sid) %>%
select(Pos = X2, Cov = X3, sample_id) %>%
filter(Pos >= 4000, Pos < 12000)
}) %>% Reduce("bind_rows", .),
lapply(names(depthFiles2), function(sid) {
read_tsv(depthFiles1[sid], col_names = FALSE, col_types = "cii") %>%
mutate(sample_id = sid) %>%
select(Pos = X2, Cov = X3, sample_id) %>%
filter(Pos >= 4000, Pos < 12569) %>%
mutate(Pos = sapply(Pos, us8k))
}) %>% Reduce("bind_rows", .)
) %>% filter(Pos != 3107) %>%
arrange(sample_id, Pos)
plotLabels <- left_join(Cov, select(Metadata, sample_id, tissue, individual_id), by = join_by(sample_id)) %>%
group_by(tissue) %>%
filter(Pos == 16569) %>%
ungroup()
Plot1 <- left_join(Cov, Metadata, by = join_by(sample_id)) %>%
ggplot() +
geom_smooth(aes(x = Pos, y = Cov, group = sample_id, colour = individual_id),
method = "gam", formula = y ~ s(x, bs = "cs", fx = TRUE, k = 100), se = FALSE) +
geom_text_repel(data = plotLabels, aes(x = Pos, y = Cov, label = tissue),
nudge_x = 1000, direction = "y",
box.padding = 0.5, point.padding = 0.5,
segment.color = "grey50", size = 5) +
scale_y_log10() +
theme_cowplot() +
theme(legend.position = "none") +
ggtitle("Absolute coverage rCRS")
Plot1# invisible(svg("./Figs/coverage_ONT_all.svg", height = 6, width = 12))
# Plot1
# invisible(dev.off())
invisible(png("./Figs/coverage_ONT_all.png", height = 1000, width = 2000, res = 150))
Plot1
invisible(dev.off())The high levels of coverage seen in the the mitochondrial chromosome contrast with the more sparse nuclear coverage. Based on the 100,000bp region from chr22, the nuclear genome has a coverage of 1-3 orders of magnitude lower.
In addition, the nuclear coverage is much more similar across different tissues than the mitochondrial, which exhibits very large differences (e.g., ~400X in EDTA blood vs ~200,000X in heart tissue).
The following table displays summary statistics for both the rCRS coverage (top) and the 100,000bp region in chr22 (bottom).
Cov22 <- lapply(names(depthFiles22), function(sid) {
read_tsv(depthFiles22[sid], col_names = FALSE, col_types = "cii") %>%
mutate(sample_id = sid) %>%
select(Pos = X2, Cov = X3, sample_id)
}) %>% Reduce("bind_rows", .)
summTable <- left_join(Cov, Metadata, by = join_by(sample_id)) %>%
group_by(sample_id) %>%
summarise(mean_cov = round(mean(Cov)),
stdev_cov = round(sd(Cov))) %>%
ungroup() %>%
left_join(Metadata, by = join_by(sample_id)) %>%
select(tissue, individual_id, sample_id, mean_cov, stdev_cov)
summTable22 <- left_join(Cov22, Metadata, by = join_by(sample_id)) %>%
group_by(sample_id) %>%
summarise(mean_cov = round(mean(Cov)),
stdev_cov = round(sd(Cov))) %>%
ungroup() %>%
left_join(Metadata, by = join_by(sample_id)) %>%
select(tissue, individual_id, sample_id, mean_cov, stdev_cov)
summTable %>%
kable(caption = "Summary statistics rCRS coverage per sample") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| tissue | individual_id | sample_id | mean_cov | stdev_cov |
|---|---|---|---|---|
| PFC | a | 6226-CT-0010 | 7456 | 407 |
| heart | b | 6226-CT-0011 | 199966 | 9417 |
| muscle | b | 6226-CT-0012 | 17489 | 723 |
| liver | b | 6226-CT-0013 | 18876 | 1559 |
| kidney | b | 6226-CT-0014 | 12790 | 574 |
| colon | b | 6226-CT-0015 | 3321 | 135 |
| edta_blood | c | 6226-CT-0016 | 436 | 44 |
| edta_blood | d | 6226-CT-0017 | 452 | 55 |
| cerebellum | b | 6226-CT-0018 | 3705 | 178 |
summTable22 %>%
kable(caption = "Summary statistics chr22:20,000,000-20,100,000 coverage per sample") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| tissue | individual_id | sample_id | mean_cov | stdev_cov |
|---|---|---|---|---|
| PFC | a | 6226-CT-0010 | 63 | 105 |
| heart | b | 6226-CT-0011 | 111 | 201 |
| muscle | b | 6226-CT-0012 | 81 | 155 |
| liver | b | 6226-CT-0013 | 60 | 105 |
| kidney | b | 6226-CT-0014 | 114 | 208 |
| colon | b | 6226-CT-0015 | 42 | 76 |
| edta_blood | c | 6226-CT-0016 | 39 | 70 |
| edta_blood | d | 6226-CT-0017 | 40 | 72 |
| cerebellum | b | 6226-CT-0018 | 87 | 163 |
This data can be used to estimate the a relative copy number across tissues by normalising mtDNA:nuclear coverage. This represents only a relative copy number estimate of the average in the tissue (i.e., different cell types within a tissue have different mtDNA copy numbers).
summTable_ont <- summTable %>% select(tissue, individual_id, sample_id, mean_cov_MT = mean_cov) %>%
full_join(summTable22 %>% select(tissue, individual_id, sample_id, mean_cov_chr22 = mean_cov),
by = join_by(tissue, individual_id, sample_id)) %>%
mutate(relative_copynumber = round(mean_cov_MT/mean_cov_chr22))
summTable_ont %>%
kable(caption = "Relative mtDNA copy number based on avg. depth of coverage") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| tissue | individual_id | sample_id | mean_cov_MT | mean_cov_chr22 | relative_copynumber |
|---|---|---|---|---|---|
| PFC | a | 6226-CT-0010 | 7456 | 63 | 118 |
| heart | b | 6226-CT-0011 | 199966 | 111 | 1801 |
| muscle | b | 6226-CT-0012 | 17489 | 81 | 216 |
| liver | b | 6226-CT-0013 | 18876 | 60 | 315 |
| kidney | b | 6226-CT-0014 | 12790 | 114 | 112 |
| colon | b | 6226-CT-0015 | 3321 | 42 | 79 |
| edta_blood | c | 6226-CT-0016 | 436 | 39 | 11 |
| edta_blood | d | 6226-CT-0017 | 452 | 40 | 11 |
| cerebellum | b | 6226-CT-0018 | 3705 | 87 | 43 |
Is the mtDNA copy number estimation similar in Illumina-based sequencing? We can use the samples for which we have both ONT and Illumina to investigate this.
depthFiles_wgs_1 <- paste0("../results/depth/",
filter(Metadata, platform == "illumina")$sample_id,
"_chrM.txt")
names(depthFiles_wgs_1) <- filter(Metadata, platform == "illumina")$sample_id
#all(file.exists(depthFiles_wgs_1))
depthFiles_wgs_2 <- paste0("../results/depth/",
filter(Metadata, platform == "illumina")$sample_id,
"_chrM_shifted.txt")
names(depthFiles_wgs_2) <- filter(Metadata, platform == "illumina")$sample_id
#all(file.exists(depthFiles_wgs_2))
depthFiles_wgs_22 <- paste0("../results/depth/",
filter(Metadata, platform == "illumina")$sample_id,
"_chr22.txt")
names(depthFiles_wgs_22) <- filter(Metadata, platform == "illumina")$sample_id
#all(file.exists(depthFiles_wgs_22))
## COVERAGE / SUMMARY STATS
Cov_wgs <- bind_rows(
Cov_wgs_1 <- lapply(names(depthFiles_wgs_1), function(sid) {
read_tsv(depthFiles_wgs_1[sid], col_names = FALSE, col_types = "cii") %>%
mutate(sample_id = sid) %>%
select(Pos = X2, Cov = X3, sample_id) %>%
filter(Pos >= 4000, Pos < 12000)
}) %>% Reduce("bind_rows", .),
Cov_wgs_2 <- lapply(names(depthFiles_wgs_2), function(sid) {
read_tsv(depthFiles_wgs_2[sid], col_names = FALSE, col_types = "cii") %>%
mutate(sample_id = sid) %>%
select(Pos = X2, Cov = X3, sample_id) %>%
filter(Pos >= 4000, Pos < 12569) %>%
mutate(Pos = sapply(Pos, us8k))
})
) %>% filter(Pos != 3107) %>%
arrange(sample_id, Pos)
Cov_wgs_22 <- lapply(names(depthFiles_wgs_22), function(sid) {
read_tsv(depthFiles_wgs_22[sid], col_names = FALSE, col_types = "cii") %>%
mutate(sample_id = sid) %>%
select(Pos = X2, Cov = X3, sample_id)
}) %>% Reduce("bind_rows", .)
summTable_wgs <- left_join(Cov_wgs, Metadata, by = join_by(sample_id)) %>%
group_by(sample_id) %>%
summarise(mean_cov = round(mean(Cov)),
stdev_cov = round(sd(Cov))) %>%
ungroup() %>%
left_join(Metadata, by = join_by(sample_id)) %>%
select(tissue, individual_id, sample_id, mean_cov, stdev_cov)
summTable_wgs_22 <- left_join(Cov_wgs_22, Metadata, by = join_by(sample_id)) %>%
group_by(sample_id) %>%
summarise(mean_cov = round(mean(Cov)),
stdev_cov = round(sd(Cov))) %>%
ungroup() %>%
left_join(Metadata, by = join_by(sample_id)) %>%
select(tissue, individual_id, sample_id, mean_cov, stdev_cov)
summTable_illumina <- summTable_wgs %>% select(tissue, individual_id, sample_id, mean_cov_MT = mean_cov) %>%
full_join(summTable_wgs_22 %>% select(tissue, individual_id, sample_id, mean_cov_chr22 = mean_cov),
by = join_by(tissue, individual_id, sample_id)) %>%
mutate(relative_copynumber = round(mean_cov_MT/mean_cov_chr22))
bind_rows(summTable_illumina, summTable_ont) %>% filter(tissue == "edta_blood") %>%
left_join(Metadata %>% select(sample_id, platform), by = join_by(sample_id)) %>%
select(individual_id, platform, cov_MT = mean_cov_MT,
cov_chr22 = mean_cov_chr22, copy_nr = relative_copynumber) %>%
pivot_wider(names_from = platform, values_from = c(cov_MT, cov_chr22, copy_nr)) %>%
kable(caption = "Relative mtDNA copy number in EDTA blood samples (ONT and Illumina)") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| individual_id | cov_MT_illumina | cov_MT_ont | cov_chr22_illumina | cov_chr22_ont | copy_nr_illumina | copy_nr_ont |
|---|---|---|---|---|---|---|
| c | 6325 | 436 | 46 | 39 | 138 | 11 |
| d | 6433 | 452 | 45 | 40 | 143 | 11 |
These results indicate a stark contrast between the two sequencing technologies in their relative coverages of the nuclear and mitochondrial genomes. According to the typically observed values (around 100, source), it seems that Illumina is more accurate and ONT is biased towards nuclear DNA.
Due to a higher error rate in long-read technology (at least compared to Illumina NGS sequencing technology), it is not possible to confidently estimate mutational loads. However, it is still possible to study relative mutational loads across tissues.
First, we can compare the error rate of ONT and Illumina sequencing platfoms by leveraging on the EDTA blood samples from the same 2 individuals. The sequencing error rate is here estimated using the chr22:20,000,000-20,100,000 region, which should only exhibit diploid genotypes. Only positions with a coverage within two standard deviations of the mean coverage are considered. We then calculate the number of erroneous calls as the bases that are different from the genotype.
The following plots represent this error along the genomic reference studied for the four samples (top samples, ONT, bottom samples, Illumina). Missing regions correspond to low coverage. The discrete nature of the coverage results in the “steps” seen in the plots.
In the case of the Illumina samples, some small regions seem to be problematic for both samples, suggesting that sequence composition is playing a role.
perNucFiles <- paste0("../results/depth/",
filter(Metadata, tissue == "edta_blood")$sample_id,
"_per_nuc_depth_chr22.txt")
names(perNucFiles) <- filter(Metadata, tissue == "edta_blood")$sample_id
#all(file.exists(perNucFiles))
perNucCov <- lapply(names(perNucFiles), function(sid) {
read_tsv(perNucFiles[sid], col_select = c(1,2,4,6,8,10),
c("seq", "pos", "a", "A", "c", "C", "g", "G", "t", "T")) %>%
mutate(sample_id = sid)
}) %>% Reduce("bind_rows", .)
perNucCov_filt <- perNucCov %>% mutate(Cov = A + C + G + T) %>%
group_by(sample_id) %>%
mutate(mean_cov = mean(Cov), std_cov_2 = 2*sd(Cov)) %>%
mutate(keep = ifelse(Cov > (mean_cov - std_cov_2) & Cov < (mean_cov + std_cov_2), TRUE, FALSE)) %>%
filter(keep, Cov > 0) %>%
select(-mean_cov, -std_cov_2, -keep) %>%
ungroup()
errors_per_base <- perNucCov_filt %>%
pivot_longer(A:T, names_to = "base", values_to = "count") %>%
arrange(sample_id, pos, desc(count)) %>%
group_by(sample_id, pos) %>%
mutate(hom_GT = base[1], het_GT = ifelse(count[2] > 0, base[2], NA)) %>%
mutate(hom_right = count[1]) %>%
mutate(het_right = ifelse(!is.na(het_GT), min(Cov[1]/2,count[1])+min(Cov[1]/2,count[2]), NA)) %>%
select(sample_id, pos, Cov, hom_right, het_right) %>%
distinct() %>%
mutate(right = pmax(hom_right, het_right, na.rm = TRUE)) %>%
mutate(wrong = Cov - right) %>%
select(-hom_right, -het_right)
Plot3 <- errors_per_base %>%
ggplot(aes(x = pos, y = wrong/Cov)) +
geom_point(size = 1) +
facet_grid(sample_id ~ .)
Plot3invisible(png("./Figs/GT_dispersion.png", height = 2000, width = 6000, res = 150))
Plot3
invisible(dev.off())The following table shows the error rate calculated based on this region.
errors_per_base %>% group_by(sample_id) %>%
summarise(error_rate = formatC(sum(wrong)/sum(Cov), digits = 2)) %>%
left_join(Metadata %>% select(sample_id, platform)) %>%
kable(caption = "Error rate in chr22:20,000,000-20,100,000") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| sample_id | error_rate | platform |
|---|---|---|
| 6226-CT-0016 | 0.052 | ont |
| 6226-CT-0017 | 0.054 | ont |
| SL443862 | 0.0035 | illumina |
| SL443863 | 0.0037 | illumina |
Due to the high error rate of the ONT technology, it is not possible to calculate an absolute mutational load in the mtDNA at the single molecule level. However, it is still possible to assess differences in mutational load across tissues.
To estimate the mutational load M, we calculate the proportion of basecalls per position that are different from the genotype at the position. Then, we logit-transform these values (which would be ranging from 0 to 1) so that they are normally distributed M = log2(e/(1-e)).
perNucFiles2 <- paste0("../results/depth/",
filter(Metadata, platform == "ont", individual_id %in% c("a","b"))$sample_id,
"_per_nuc_depth_chrM.txt")
names(perNucFiles2) <- filter(Metadata, platform == "ont", individual_id %in% c("a","b"))$sample_id
#all(file.exists(perNucFiles2))
perNucCov2 <- lapply(names(perNucFiles2), function(sid) {
read_tsv(perNucFiles2[sid], col_select = c(1,2,4,6,8,10),
c("seq", "pos", "a", "A", "c", "C", "g", "G", "t", "T")) %>%
mutate(sample_id = sid)
}) %>% Reduce("bind_rows", .) %>%
mutate(Cov = A + C + G + T)
mut_load <- perNucCov2 %>%
pivot_longer(A:T, names_to = "base", values_to = "count") %>%
arrange(sample_id, pos, desc(count)) %>%
group_by(sample_id, pos) %>%
mutate(hom_GT = base[1]) %>%
mutate(right = count[1]) %>%
select(sample_id, pos, Cov, right) %>%
distinct() %>%
filter(pos != 3107) %>%
left_join(select(Metadata, sample_id, tissue), by = join_by(sample_id)) %>%
mutate(mut_load = (Cov - right)/Cov)
mut_load$mut_load <- transf.betareg(mut_load$mut_load)
mut_load$M <- log2(mut_load$mut_load / (1 - mut_load$mut_load))
Plot4_dist <- mut_load %>% ggplot(aes(x = M)) +
geom_density(aes(colour = tissue)) +
ggtitle("mtDNA non-GT basecall proportion") +
xlim(-10, 0)
#Plot4 <- mut_load %>%
# ggplot(aes(x = pos, y = M)) +
# geom_point() +
# facet_grid(sample_id ~ .)
Plot4_dist#Plot4
invisible(png("./Figs/MT_M_dist.png", height = 1000, width = 1500, res = 150))
Plot4_dist
invisible(dev.off())
#invisible(png("./Figs/MT_M_load.png", height = 2000, width = 6000, res = 150))
#Plot4
#invisible(dev.off())However, this is just an average mutational load in the entire mtDNA population. We can now focus on the distribution of “errors” per read, i.e., the proportion of mismatches:matches in the read alignment. We use a logit transformation of the proportion of mismatches in the read:
M = logit(mismatches/(mismatches+matches))
snvsPerReadFiles <- paste0("../results/snvs_per_read/",
filter(Metadata, platform == "ont", individual_id %in% c("a","b"))$sample_id,
".tsv")
names(snvsPerReadFiles) <- filter(Metadata, platform == "ont", individual_id %in% c("a","b"))$sample_id
#all(file.exists(snvsPerReadFiles))
snvsPerRead <- lapply(names(snvsPerReadFiles), function(sid) {
#message(sid)
read_tsv(snvsPerReadFiles[sid], col_types = "cii") %>%
group_by(read_id) %>%
slice_max(matches, n = 1, with_ties = FALSE) %>%
ungroup() %>%
mutate(sample_id = sid)
}) %>% Reduce("bind_rows", .) %>%
left_join(select(Metadata, sample_id, tissue), by = join_by(sample_id)) %>%
mutate(mut_load = mismatches / (matches + mismatches))
snvsPerRead$mut_load <- transf.betareg(snvsPerRead$mut_load)
snvsPerRead$M <- log2(snvsPerRead$mut_load / (1 - snvsPerRead$mut_load))
Plot5 <- snvsPerRead %>% filter(M > -10) %>%
ggplot() +
geom_density(aes(x = M, colour = tissue)) +
geom_vline(xintercept = -3.5, linetype = 2) +
ggtitle("mtDNA non-reference basecall proportion PER READ") +
xlim(-10, 0)
#Plot6 <- snvsPerRead %>% filter(M > -10) %>%
# ggplot() +
# geom_point(aes(x = M, y = matches + mismatches)) +
# facet_grid(sample_id~.)
Plot5invisible(png("./Figs/MT_M_perRead_dist.png", height = 1000, width = 1500, res = 150))
Plot5
invisible(dev.off())Using M as a proxy for the mutational load in the mtDNA sequence that was the template for the read, we observerve that the mtDNA molecules exhibit a marked bimodal distribution. A population of mtDNA molecules harbours more mutations than the prevailing mtDNA population. The threshold seems to be around 8% error rate (M = -3.5). See dashed vertical line above.
Strikingly, the relative proportions of these mtDNA species are highly tissue-specific.
snvsPerRead %>%
mutate(high_mut_load = ifelse(M > -2.8, TRUE, FALSE)) %>%
group_by(sample_id, tissue) %>%
summarise(N = n(), nb_high_mut = sum(high_mut_load), prop_high_mut = nb_high_mut / N, .groups = "drop") %>%
mutate(prop_high_mut = formatC(prop_high_mut, digits = 2)) %>%
kable(caption = "Relative proportions of high-mutant mtDNA reads") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| sample_id | tissue | N | nb_high_mut | prop_high_mut |
|---|---|---|---|---|
| 6226-CT-0010 | PFC | 24612 | 2305 | 0.094 |
| 6226-CT-0011 | heart | 653011 | 3169 | 0.0049 |
| 6226-CT-0012 | muscle | 55895 | 2754 | 0.049 |
| 6226-CT-0013 | liver | 212363 | 1624 | 0.0076 |
| 6226-CT-0014 | kidney | 60584 | 3626 | 0.06 |
| 6226-CT-0015 | colon | 15712 | 1210 | 0.077 |
| 6226-CT-0018 | cerebellum | 14415 | 3122 | 0.22 |
These highly-mutated mtDNA sequences tend to be smaller than the prevailing mtDNA population. The following table tests for each sample (tissue) the sizes of the two groups of mtDNA molecules, those with more than 12% mutational load and those with less.
gc_mt <- left_join(snvsPerRead,
lapply(unique(snvsPerRead$sample_id), function(sid) {
#message(sid)
read_tsv(paste0("/data1/romain/ont/results/gc_content/", sid, "_chrM.tsv"), col_types = "cd") %>%
group_by(read_id) %>%
slice_head(n = 1) %>%
ungroup() %>%
mutate(sample_id = sid)
}) %>% Reduce("bind_rows", .), by = join_by(read_id, sample_id))
Data <- gc_mt %>%
mutate(high_mut_load = as.integer(ifelse(M > -2.8, 1, 0))) %>%
mutate(len = matches + mismatches)
lapply(unique(snvsPerRead$sample_id), function(sid) {
glm(high_mut_load ~ len, data = filter(Data, sample_id == sid) %>% mutate(len = scale2(len)), family = "binomial") %>%
broom::tidy() %>% filter(term != "(Intercept)") %>%
mutate(sample_id = sid) %>%
select(sample_id, everything())
}) %>% Reduce("bind_rows", .) %>%
select(sample_id, term, estimate, statistic, p.value) %>%
mutate(signif = stars.pvalue(p.value)) %>%
mutate(estimate = formatC(estimate, digits = 2), statistic = formatC(statistic, digits = 2), p.value = formatC(p.value, digits = 2)) %>%
kable(caption = "Association between read length and highly-mutated mtDNA") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| sample_id | term | estimate | statistic | p.value | signif |
|---|---|---|---|---|---|
| 6226-CT-0010 | len | -1.2 | -30 | 2e-192 | *** |
| 6226-CT-0011 | len | -2.3 | -47 | 0 | *** |
| 6226-CT-0012 | len | -1.7 | -48 | 0 | *** |
| 6226-CT-0013 | len | -0.081 | -3 | 0.0027 | ** |
| 6226-CT-0014 | len | -0.8 | -31 | 3.7e-211 | *** |
| 6226-CT-0015 | len | -1.3 | -25 | 7.1e-143 | *** |
| 6226-CT-0018 | len | -0.98 | -31 | 1.4e-215 | *** |
Plot7 <- Data %>% ggplot(aes(x = as.factor(high_mut_load), y = len)) +
#geom_violin() +
geom_boxplot(outlier.shape = NA) +
facet_grid(~tissue)
Plot7invisible(png("./Figs/MT_mutator_species.png", height = 1000, width = 1500, res = 150))
Plot7
invisible(dev.off())It is unlikely that this is a sequencing artifact, since different tissues present different proportions of these “high mutator” mtDNA species. To confirm this, we can test whether chr22 also has a similar bimodal distribution.
snvsPerReadFiles_chr22 <- paste0("../results/snvs_per_read/",
filter(Metadata, platform == "ont", individual_id %in% c("a","b"))$sample_id,
"_chr22.tsv")
names(snvsPerReadFiles_chr22) <- filter(Metadata, platform == "ont", individual_id %in% c("a","b"))$sample_id
#all(file.exists(snvsPerReadFiles_chr22))
snvsPerRead_chr22 <- lapply(names(snvsPerReadFiles_chr22), function(sid) {
#message(sid)
read_tsv(snvsPerReadFiles_chr22[sid], col_types = "cii") %>%
mutate(sample_id = sid)
}) %>% Reduce("bind_rows", .) %>%
mutate(mut_load = mismatches / (matches + mismatches)) %>%
left_join(select(Metadata, sample_id, tissue), by = join_by(sample_id))
snvsPerRead_chr22$mut_load <- transf.betareg(snvsPerRead_chr22$mut_load)
snvsPerRead_chr22$M <- log2(snvsPerRead_chr22$mut_load / (1 - snvsPerRead_chr22$mut_load))
Plot8 <- bind_rows(snvsPerRead %>% mutate(seqnames = "mtDNA"),
snvsPerRead_chr22 %>% mutate(seqnames = "chr22") %>%
select(read_id, matches:sample_id, tissue, mut_load, M, seqnames)) %>%
filter(M > -10) %>%
ggplot() +
geom_density(aes(x = M, colour = tissue)) +
facet_grid(seqnames~.) +
geom_vline(xintercept = -3.5, linetype = 2) +
ggtitle("mtDNA non-reference basecall proportion PER READ")
Plot8invisible(png("./Figs/chr22_M_perRead_dist.png", height = 1800, width = 1500, res = 150))
Plot8
invisible(dev.off())We also test now whether the GC content is different between these different mtDNA populations.
lapply(unique(snvsPerRead$sample_id), function(sid) {
glm(high_mut_load ~ gc_proportion, data = filter(Data, sample_id == sid) %>%
mutate(gc_proportion = scale2(gc_proportion)),
family = "binomial") %>%
broom::tidy() %>% filter(term != "(Intercept)") %>%
mutate(sample_id = sid) %>%
select(sample_id, everything())
}) %>% Reduce("bind_rows", .) %>%
select(sample_id, term, estimate, statistic, p.value) %>%
mutate(signif = stars.pvalue(p.value)) %>%
mutate(estimate = formatC(estimate, digits = 2), statistic = formatC(statistic, digits = 2), p.value = formatC(p.value, digits = 2)) %>%
kable(caption = "Association between GC proportion and highly-mutated mtDNA") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| sample_id | term | estimate | statistic | p.value | signif |
|---|---|---|---|---|---|
| 6226-CT-0010 | gc_proportion | -1.2 | -57 | 0 | *** |
| 6226-CT-0011 | gc_proportion | -0.94 | -1.1e+02 | 0 | *** |
| 6226-CT-0012 | gc_proportion | -1.2 | -75 | 0 | *** |
| 6226-CT-0013 | gc_proportion | -0.94 | -45 | 0 | *** |
| 6226-CT-0014 | gc_proportion | -1.1 | -74 | 0 | *** |
| 6226-CT-0015 | gc_proportion | -1.3 | -43 | 0 | *** |
| 6226-CT-0018 | gc_proportion | -1.3 | -51 | 0 | *** |
Plot9 <- Data %>% ggplot(aes(x = as.factor(high_mut_load), y = gc_proportion)) +
geom_boxplot(outlier.shape = NA) +
ylim(0.25, 0.55) +
facet_grid(~tissue)
Plot9invisible(png("./Figs/MT_mutator_species_GC_content.png", height = 1000, width = 1500, res = 150))
Plot9
invisible(dev.off())In order to characterize better this population of highly-mutated mtDNA sequences, we run a generalized linear model where the mtDNA class (highly mutated, 1 or not highly mutated, 0) is predicted simultaneously by the GC proportion (as before), average sequencing quality of the read, number of deletions in the aligned read (normalised by its length), and proportion of the aligned read made up of deletion (i.e., the total sum of the deletion lengths in the alignment divided by the read length).
highly mutated mtDNA ~ GC_proportion + Avg_read_qual + n_deletions + prop_del_bases
(All covariates are standardized to render coefficients comparable).
delFiles1 <- paste0("../results/deletions/",
filter(Metadata, platform == "ont", individual_id %in% c("a","b"))$sample_id,
"_chrM.txt")
names(delFiles1) <- filter(Metadata, platform == "ont", individual_id %in% c("a","b"))$sample_id
#all(file.exists(delFiles1))
delFiles2 <- paste0("../results/deletions/",
filter(Metadata, platform == "ont", individual_id %in% c("a","b"))$sample_id,
"_chrM_shifted.txt")
names(delFiles2) <- filter(Metadata, platform == "ont", individual_id %in% c("a","b"))$sample_id
#all(file.exists(delFiles2))
Del1 <- lapply(names(delFiles1), function(sid) {
#message(sid)
read_tsv(delFiles1[sid], col_types = "ciiid") %>%
mutate(sample_id = sid)
}) %>% Reduce("bind_rows", .) %>%
filter(del_end - del_start > 1)
Del2 <- lapply(names(delFiles2), function(sid) {
#message(sid)
read_tsv(delFiles2[sid], col_types = "ciiid") %>%
mutate(sample_id = sid)
}) %>% Reduce("bind_rows", .) %>%
filter(del_end - del_start > 1)
Del2$del_start <- sapply(Del2$del_start, us8k)
Del2$del_end <- sapply(Del2$del_end, us8k)
Del <- bind_rows(Del1, Del2) %>%
group_by(sample_id, read_id, del_start, del_end, read_length, avg_qual) %>%
summarise(support = n(), .groups = "drop")
rm(Del1, Del2)
invisible(gc())nb_dels <- Del %>%
mutate(del_size = ifelse(del_start <= del_end,
del_end - del_start,
del_end + (16569 - del_start))) %>%
group_by(sample_id, read_id, read_length, avg_qual) %>%
summarise(n_dels = n(), n_deleted_bases = sum(del_size),
median_del_size = median(del_size),
.groups = "drop") %>%
mutate(n_dels_per_base = n_dels / read_length,
n_deleted_bases_per_base = n_deleted_bases / read_length) %>%
group_by(sample_id, read_id) %>%
slice_max(read_length, n = 1) %>%
ungroup()
Data <- inner_join(Data %>% select(sample_id, tissue, read_id = read_id, everything()), nb_dels, by = join_by(sample_id, read_id))
Tests <- lapply(unique(snvsPerRead$sample_id), function(sid) {
glm(high_mut_load ~ gc_proportion + avg_qual + n_dels_per_base + n_deleted_bases_per_base + len,
data = filter(Data, sample_id == sid) %>%
mutate(gc_proportion = scale2(gc_proportion),
avg_qual = scale2(avg_qual),
n_dels_per_base = scale2(n_dels_per_base),
len = scale2(len),
n_deleted_bases_per_base = scale2(n_deleted_bases_per_base)),
family = "binomial") %>%
broom::tidy() %>% filter(term != "(Intercept)") %>%
mutate(sample_id = sid) %>%
select(sample_id, everything())
})
for (Test in Tests) {
Test %>% select(sample_id, term, estimate, statistic, p.value) %>%
mutate(signif = stars.pvalue(p.value)) %>%
mutate(estimate = formatC(estimate, digits = 3),
statistic = formatC(statistic, digits = 3),
p.value = formatC(p.value, digits = 3)) %>%
kable(caption = paste0(Test$sample_id[1], " (", Metadata[Metadata$sample_id == Test$sample_id[1],]$tissue[1], ")")) %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE) %>%
print()
cat("\n")
}| sample_id | term | estimate | statistic | p.value | signif |
|---|---|---|---|---|---|
| 6226-CT-0010 | gc_proportion | -1.09 | -43.8 | 0 | *** |
| 6226-CT-0010 | avg_qual | 0.0315 | 1.13 | 0.259 | |
| 6226-CT-0010 | n_dels_per_base | -0.143 | -4.47 | 7.95e-06 | *** |
| 6226-CT-0010 | n_deleted_bases_per_base | 0.0109 | 0.327 | 0.743 | |
| 6226-CT-0010 | len | -0.492 | -11.9 | 8.82e-33 | *** |
| sample_id | term | estimate | statistic | p.value | signif |
|---|---|---|---|---|---|
| 6226-CT-0011 | gc_proportion | -0.925 | -85.1 | 0 | *** |
| 6226-CT-0011 | avg_qual | 0.491 | 23 | 8.45e-117 | *** |
| 6226-CT-0011 | n_dels_per_base | 0.557 | 34.8 | 2.54e-265 | *** |
| 6226-CT-0011 | n_deleted_bases_per_base | 0.0359 | 3.25 | 0.00117 | ** |
| 6226-CT-0011 | len | -1.2 | -26.6 | 1.92e-156 | *** |
| sample_id | term | estimate | statistic | p.value | signif |
|---|---|---|---|---|---|
| 6226-CT-0012 | gc_proportion | -1.09 | -60.1 | 0 | *** |
| 6226-CT-0012 | avg_qual | 0.241 | 8.87 | 7.62e-19 | *** |
| 6226-CT-0012 | n_dels_per_base | 0.326 | 13.7 | 8.35e-43 | *** |
| 6226-CT-0012 | n_deleted_bases_per_base | 0.00639 | 0.184 | 0.854 | |
| 6226-CT-0012 | len | -1.19 | -29.3 | 3.58e-188 | *** |
| sample_id | term | estimate | statistic | p.value | signif |
|---|---|---|---|---|---|
| 6226-CT-0013 | gc_proportion | -1.28 | -49.9 | 0 | *** |
| 6226-CT-0013 | avg_qual | 0.488 | 16.8 | 2.1e-63 | *** |
| 6226-CT-0013 | n_dels_per_base | 0.765 | 32 | 1.44e-224 | *** |
| 6226-CT-0013 | n_deleted_bases_per_base | 0.0168 | 1.49 | 0.135 | |
| 6226-CT-0013 | len | 0.0283 | 0.913 | 0.361 |
| sample_id | term | estimate | statistic | p.value | signif |
|---|---|---|---|---|---|
| 6226-CT-0014 | gc_proportion | -1.23 | -68.8 | 0 | *** |
| 6226-CT-0014 | avg_qual | 0.331 | 15.2 | 3.7e-52 | *** |
| 6226-CT-0014 | n_dels_per_base | 0.377 | 16.7 | 1.92e-62 | *** |
| 6226-CT-0014 | n_deleted_bases_per_base | 0.0573 | 3.51 | 0.000452 | *** |
| 6226-CT-0014 | len | -0.383 | -13.5 | 2.21e-41 | *** |
| sample_id | term | estimate | statistic | p.value | signif |
|---|---|---|---|---|---|
| 6226-CT-0015 | gc_proportion | -1.28 | -38.6 | 0 | *** |
| 6226-CT-0015 | avg_qual | 0.355 | 8.58 | 9.69e-18 | *** |
| 6226-CT-0015 | n_dels_per_base | 0.358 | 6.86 | 6.89e-12 | *** |
| 6226-CT-0015 | n_deleted_bases_per_base | 0.16 | 3.66 | 0.000251 | *** |
| 6226-CT-0015 | len | -0.91 | -14.7 | 1.24e-48 | *** |
| sample_id | term | estimate | statistic | p.value | signif |
|---|---|---|---|---|---|
| 6226-CT-0018 | gc_proportion | -1.27 | -44.7 | 0 | *** |
| 6226-CT-0018 | avg_qual | 0.167 | 6 | 2.02e-09 | *** |
| 6226-CT-0018 | n_dels_per_base | 0.114 | 3.78 | 0.000158 | *** |
| 6226-CT-0018 | n_deleted_bases_per_base | 0.0385 | 1.82 | 0.0692 | . |
| 6226-CT-0018 | len | -0.636 | -16.4 | 3.83e-60 | *** |
These results indicate that these highly-mutated mtDNA sequences have significantly lower GC content than the main mtDNA population in all samples. Interestingly, the average sequencing quality is higher in the highly-mutated mtDNA sequences, supporting the notion that these are not just sequencing artifacts. This is observed in all samples with the exception of the PFC, where this was not significant. The number of deletion events was also significantly associated with the high mutational load in every sample, although without a consistent direction of association. While in most samples the number of deletions was hight in highly-mutated mtDNA reads, the two brain tissues (cerebellum and PFC) exhibited the opposing trend, i.e., a decreased number od deletion events in highly-mutated mtDNA reads. The size of the mutation events had a smaller association with the highly-mutated status, and was only statistically significant in kidney and colon, where it was positively associated. Finally, the read size was negatively associated with highly-mutated mtDNA, i.e., longer sequences tend to have lower mutation frequencies (with the exception of liver, where the association was not statistically significant).
## varFiles <- paste0("../results/", filter(Metadata, platform == "ont")$sample_id, "_chrM_combined_filt_splt.vcf")
## names(varFiles) <- filter(Metadata, platform == "ont")$sample_id
##
## #varFiles1 <- paste0("../results/", filter(Metadata, platform == "ont")$sample_id, "_chrM_variants.txt")
## #names(varFiles1) <- filter(Metadata, platform == "ont")$sample_id
##
## #varFiles2 <- paste0("../results/", filter(Metadata, platform == "ont")$sample_id, "_chrM_shifted_variants.txt")
## #names(varFiles2) <- filter(Metadata, platform == "ont")$sample_id
##
##
##
##
## #########################
## ##### READ VARIANTS #####
## #########################
##
## require("vcfR")
##
## # VCF files generated using Mutect2 are already fixed in terms of coordinates
## chrM_vars <- lapply(names(varFiles), function(sid) {
## File <- varFiles[sid]
## read_tsv(File, col_types = "cciccdcdcdiiiid") %>% filter(Filter == "PASS") %>%
## filter(Pos >= 4000, Pos < 12000) %>%
## filter(Type %in% c(1,2)) %>%
## mutate(sample_id = sid) %>%
## select(-ID, -Filter)
## }) %>% Reduce("bind_rows", .)
##
##
##
##
##
## # read chrM pos. 4000-11999 from normal chrM alignments
## chrM_vars <- lapply(names(varFiles1), function(sid) {
## File <- varFiles1[sid]
## read_tsv(File, col_types = "cciccdcdcdiiiid") %>% filter(Filter == "PASS") %>%
## filter(Pos >= 4000, Pos < 12000) %>%
## filter(Type %in% c(1,2)) %>%
## mutate(sample_id = sid) %>%
## select(-ID, -Filter)
## }) %>% Reduce("bind_rows", .)
##
## # read chrM pos. 1-3999 and 12000-16569 from alignments to shifted reference,
## # which corresponds to 8570-12568 and 4000-8569
## chrM_shift_vars <- lapply(names(varFiles2), function(sid) {
## File <- varFiles2[sid]
## read_tsv(File, col_types = "cciccdcdcdiiiid") %>% filter(Filter == "PASS") %>%
## filter(Pos >= 4000, Pos < 12569) %>%
## filter(Type %in% c(1,2)) %>%
## mutate(sample_id = sid) %>%
## mutate(Pos = sapply(Pos, us8k)) %>%
## select(-ID, -Filter)
## }) %>% Reduce("bind_rows", .)
##
## all_vars <- bind_rows(chrM_vars, chrM_shift_vars) %>%
## left_join(Metadata, by = join_by(sample_id)) %>%
## select(tissue,
## all_vars
## rm(chrM_vars, chrM_shift_vars)
## invisible(gc())
##
## all_vars %>% filter(tissue %in% c("edta_blood_1", "edta_blood_2"), Type == 1) %>%
## arrange(Pos)
## all_vars %>% filter(tissue %in% c("edta_blood_1", "edta_blood_2"), Type == 2) %>%
## filter(Pos == 6)
## arrange(Pos)A first rough estimate of the presence and abundance of deletions is the length of the mtDNA-aligned reads.
The density plots show the distribution of ONT read lengths aligned to either the mtDNA (rCRS) or the chr22. The vertical dashed line indicates the rCRS length (16,569).
We can first formally compare the sizes of the reads across tissues.
#require('rstatix')
#compare_means(readlength ~ tissue, data = Len %>%
# left_join(Metadata %>% select(sample_id, individual_id, tissue), by = join_by(sample_id)) %>%
# filter(individual_id %in% c("a", "b")))
Tissues <- Metadata %>% filter(individual_id %in% c("a", "b")) %>% pull(tissue)
my_comparisons <- lapply(1:(length(Tissues)-1), function(i) lapply((i+1):length(Tissues), function(j) return(c(Tissues[i],Tissues[j])))) %>%
unlist(recursive = F)
lenBoxPlot <- Len %>%
left_join(Metadata %>% select(sample_id, individual_id, tissue), by = join_by(sample_id)) %>%
ggplot(aes(x = tissue, y = log10(readlength))) +
geom_boxplot(outlier.shape = NA)
lenBoxPlotLen %>%
left_join(Metadata %>% select(sample_id, individual_id, tissue), by = join_by(sample_id)) %>%
mutate(log10_readlength = log10(readlength)) %>%
filter(!is.infinite(log10_readlength)) %>%
compare_means(formula = log10_readlength ~ tissue, data = ., comparisons = my_comparisons, label = "p.signif") %>%
kable(caption = "Comparisong read length across tissues") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| .y. | group1 | group2 | p | p.adj | p.format | p.signif | method |
|---|---|---|---|---|---|---|---|
| log10_readlength | PFC | heart | 0.0000000 | 0.00000 | < 2e-16 | **** | Wilcoxon |
| log10_readlength | PFC | muscle | 0.0000000 | 0.00000 | < 2e-16 | **** | Wilcoxon |
| log10_readlength | PFC | liver | 0.0000000 | 0.00000 | < 2e-16 | **** | Wilcoxon |
| log10_readlength | PFC | kidney | 0.0000000 | 0.00000 | < 2e-16 | **** | Wilcoxon |
| log10_readlength | PFC | colon | 0.0000000 | 0.00000 | < 2e-16 | **** | Wilcoxon |
| log10_readlength | PFC | edta_blood | 0.0000000 | 0.00000 | < 2e-16 | **** | Wilcoxon |
| log10_readlength | PFC | cerebellum | 0.0000000 | 0.00000 | < 2e-16 | **** | Wilcoxon |
| log10_readlength | heart | muscle | 0.0076086 | 0.01500 | 0.0076 | ** | Wilcoxon |
| log10_readlength | heart | liver | 0.0000000 | 0.00000 | < 2e-16 | **** | Wilcoxon |
| log10_readlength | heart | kidney | 0.0000000 | 0.00000 | < 2e-16 | **** | Wilcoxon |
| log10_readlength | heart | colon | 0.0000000 | 0.00000 | < 2e-16 | **** | Wilcoxon |
| log10_readlength | heart | edta_blood | 0.0000000 | 0.00000 | < 2e-16 | **** | Wilcoxon |
| log10_readlength | heart | cerebellum | 0.0000000 | 0.00000 | < 2e-16 | **** | Wilcoxon |
| log10_readlength | muscle | liver | 0.0000000 | 0.00000 | < 2e-16 | **** | Wilcoxon |
| log10_readlength | muscle | kidney | 0.0000000 | 0.00000 | < 2e-16 | **** | Wilcoxon |
| log10_readlength | muscle | colon | 0.0000000 | 0.00000 | < 2e-16 | **** | Wilcoxon |
| log10_readlength | muscle | edta_blood | 0.0000000 | 0.00000 | < 2e-16 | **** | Wilcoxon |
| log10_readlength | muscle | cerebellum | 0.0000000 | 0.00000 | < 2e-16 | **** | Wilcoxon |
| log10_readlength | liver | kidney | 0.0000000 | 0.00000 | < 2e-16 | **** | Wilcoxon |
| log10_readlength | liver | colon | 0.0000000 | 0.00000 | < 2e-16 | **** | Wilcoxon |
| log10_readlength | liver | edta_blood | 0.0000000 | 0.00000 | < 2e-16 | **** | Wilcoxon |
| log10_readlength | liver | cerebellum | 0.0000000 | 0.00000 | < 2e-16 | **** | Wilcoxon |
| log10_readlength | kidney | colon | 0.0000448 | 0.00013 | 4.5e-05 | **** | Wilcoxon |
| log10_readlength | kidney | edta_blood | 0.0000000 | 0.00000 | < 2e-16 | **** | Wilcoxon |
| log10_readlength | kidney | cerebellum | 0.2383921 | 0.24000 | 0.2384 | ns | Wilcoxon |
| log10_readlength | colon | edta_blood | 0.0000000 | 0.00000 | < 2e-16 | **** | Wilcoxon |
| log10_readlength | colon | cerebellum | 0.0000000 | 0.00000 | 1.1e-08 | **** | Wilcoxon |
| log10_readlength | edta_blood | cerebellum | 0.0000000 | 0.00000 | < 2e-16 | **** | Wilcoxon |
lm(formula = log10_readlength ~ tissue, data = Len %>%
left_join(Metadata %>% select(sample_id, individual_id, tissue), by = join_by(sample_id)) %>%
mutate(log10_readlength = log10(readlength)) %>%
filter(!is.infinite(log10_readlength))) %>%
broom::tidy() %>%
kable(caption = "Linear model comparing read length across tissues") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 3.2682446 | 0.0020919 | 1562.35501 | 0 |
| tissuecolon | 0.0424773 | 0.0033535 | 12.66661 | 0 |
| tissueedta_blood | -0.1486640 | 0.0031622 | -47.01311 | 0 |
| tissueheart | 0.1554573 | 0.0021416 | 72.58932 | 0 |
| tissuekidney | 0.0488258 | 0.0025056 | 19.48694 | 0 |
| tissueliver | -0.0800692 | 0.0022859 | -35.02816 | 0 |
| tissuemuscle | 0.1109153 | 0.0025117 | 44.15936 | 0 |
| tissuePFC | 0.0573036 | 0.0028207 | 20.31568 | 0 |
Now we compare the distributions of deletion lengths ( >100) across tissues.
delBoxPlot <- Del %>%
mutate(del_size = ifelse(del_start <= del_end,
del_end - del_start,
del_end + (16569 - del_start))) %>%
filter(del_size >= 100) %>%
left_join(Metadata %>% select(sample_id, individual_id, tissue), by = join_by(sample_id)) %>%
ggplot(aes(x = tissue, y = log10(del_size))) +
geom_boxplot(outlier.shape = NA)
delBoxPlotlm(formula = log10_del_size ~ tissue + read_length, data =
Del %>%
mutate(del_size = ifelse(del_start <= del_end,
del_end - del_start,
del_end + (16569 - del_start))) %>%
filter(del_size >= 100) %>%
left_join(Metadata %>% select(sample_id, individual_id, tissue), by = join_by(sample_id)) %>%
mutate(log10_del_size = log10(del_size))) %>%
broom::tidy() %>%
kable(caption = "Linear model comparing deletion size across tissues") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 2.2687847 | 0.0641501 | 35.3668172 | 0.0000000 |
| tissuecolon | -0.0334776 | 0.1107483 | -0.3022853 | 0.7624502 |
| tissueheart | 0.2290750 | 0.0617619 | 3.7090054 | 0.0002109 |
| tissuekidney | 0.0281485 | 0.0752157 | 0.3742370 | 0.7082478 |
| tissueliver | 0.1691763 | 0.0737804 | 2.2929712 | 0.0219011 |
| tissuemuscle | 0.5949035 | 0.0674216 | 8.8236385 | 0.0000000 |
| tissuePFC | 0.6128355 | 0.0717466 | 8.5416697 | 0.0000000 |
| read_length | 0.0000125 | 0.0000023 | 5.4242017 | 0.0000001 |
Here we plot the deletions larger than 100bp in the mtDNA, based on single reads from the ONT sequencing.
The plots are large and are, therefore, exported into svg images in the Figs/ folder. Only the PFC is plotted here as an example.
# mtDNA gene annotation
mt_genes_df <- readRDS("./Data/EnsDb.Hsapiens.v75.Rds") %>%
as_tibble() %>%
filter(seqnames == "MT") %>%
select(start, end, strand, gene_name, gene_biotype) %>%
unique() %>%
mutate(seqnames = factor("mtDNA")) %>%
mutate(gene_name = str_replace(gene_name, "^MT-", "")) %>%
mutate(drop = ifelse(gene_name %in% c("ATP6", "ND4", "TQ"), TRUE, FALSE)) %>%
mutate(big = ifelse(gene_name %in% c("RNR1", "RNR2", "ND1", "ND2",
"CO1", "CO2", "ATP8", "ATP6",
"CO3", "ND3", "ND4L", "ND4",
"ND5", "CYB"), TRUE, FALSE)) %>%
select(seqnames, everything())
#
#tibble(seqnames = "chrM", start = seq(1,16001, by = 100),
# end = c(seq(100, 16000, by = 100), 16569)) %>%
#write_tsv("./Data/ranges.bed", col_names = FALSE)
#
## bedtools nuc -fi /ref/MT/Homo_sapiens_assembly38.chrM.fasta \
## -bed analyses/Data/ranges.bed | \
## cut -f1-3,5 > analyses/Data/rCRS_GC_content.tsv
#
# GC content of the mtDNA sequence in 100bp windows
mt_gc_content <- read_tsv("./Data/rCRS_GC_content.tsv",
col_names = c("seqnames", "start", "end", "GC_prop"),
col_types = "ciid",
comment = "#") %>%
mutate(seqnames = factor("mtDNA"))
for (SID in unique(Del$sample_id)) {
invisible(svg(paste0("./Figs/circos_", SID, ".svg"), width = 10, height = 10))
circ_arcs(sid = SID, min_del_len = 100, alpha = 0.5)
invisible(dev.off())
invisible(svg(paste0("./Figs/del_cov_circos_", SID, ".svg"), width = 10, height = 10))
circ_del_cov(sid = SID, min_del_len = 100, alpha = 0.5)
invisible(dev.off())
}circ_del_cov(sid = "6226-CT-0010", min_del_len = 100, alpha = 0.5)